Text analysis is quantitative analysis that is used to make unstructured text, such as reviews, open-ended survey responses, or customer tickets, usable for analysis in a range of settings. Text analysis results are typically observational, but can provide useful exploratory insights and support hypothesis generation.
Define question to be addressed
Obtain text data after evaluating data sources, quality, and any ethical considerations
Conduct exploratory data analyses
Apply text analysis techniques, such as (but not limited to):
Sentiment analysis for quantifying positive and negative linguistic sentiment
Term frequency and keyword extraction to examine important terms
Topic modeling or cluster analyses to summarize themes
Text classification to categorize text into labels of interest
Interpret and visualize results
Formulate insights
See all software references at the end of the tutorial.
#load packages and comment their uses
library(tidyverse) #data cleaning, organization, and visualization
library(psych) #summarize descriptive statistics and distributions
library(stringr) #data cleaning, organization, and visualization
library(skimr) #summarize descriptive statistics and distributions
library(tidytext) #sentiment analysis, text cleaning, and word frequency
library(textclean) #text cleaning and word frequency
library(wordcloud) #text cleaning and word frequency
library(tm) #word association
library(vader) #sentiment analysis
library(topicmodels) #topic modeling
library(MetBrewer) #plot color palettes
# color palette
MetBrew_Egypt <- MetBrewer::met.brewer("Egypt", n = 5)
MetBrew_Tam <- MetBrewer::met.brewer("Tam", n = 15)
Data source: random subset of 10,000 reviews (2% of dataset) from the Amazon Fine Foods reviews dataset (McAuley & Leskovec, 2013), learn more and download data here: https://snap.stanford.edu/data/web-FineFoods.html
#read data downloaded from Amazon Fine Foods reviews source
df_raw <- read.delim("../foods.txt", header = F) #load raw data saved to wd
df_raw$V1 <- iconv(df_raw$V1, from = "", to = "UTF-8") #convert all text to UTF-8
#cleaning raw dataframe to create one column per item within V1, one row per review using regex and dplyr
df <- df_raw %>%
dplyr::mutate(V1 = str_split(V1, "\\n")) %>%
unnest(V1) %>%
dplyr::mutate(V1 = str_trim(V1)) %>%
dplyr::filter(V1 != "") %>%
dplyr::mutate(V1 = str_split(V1, "(?=review/) | (?=product/)")) %>%
unnest(V1) %>%
dplyr::mutate(V1 = str_trim(V1)) %>%
dplyr::filter(V1 != "") %>%
dplyr::mutate(names = str_extract(V1, "^review/\\w+|^product/\\w+"),
values = str_remove(V1, "^review/\\w+:\\s*|^product/\\w+:\\s*")) %>%
select(names, values) %>%
dplyr::mutate(review_id = as.numeric(cumsum(names == "product/productId"))) %>%
pivot_wider(names_from = names, values_from = values) %>%
select(-c("review/time", "review/summary", "review/profileName")) %>%
rename("productId" = "product/productId",
"userId" = "review/userId",
"score" = "review/score",
"text" = "review/text") %>%
dplyr::mutate(userId = match(userId, sample(unique(userId))),
productId = match(productId, sample(unique(productId)))) %>%
select(c("review_id", "productId", "userId", "score", "text")) %>%
as.data.frame()
#save new ids as categorical variables
df$review_id <- as.factor(df$review_id)
df$userId <- as.factor(df$userId)
df$productId <- as.factor(df$productId)
rm(df_raw) #remove giant raw dataset from environment
set.seed(22) #set seed to sample reproducibly
df <- df %>%
dplyr::slice_sample(n = 10000) %>% #take random set of 10000 reviews - 8 rows each
unnest(c(2:5)) #unnest list cols
#replace "NULL" with NA
df[df=="NULL"] <- NA
#write.csv(df, "./foods_small.csv") ## optional: save small top 10000 reviews for future use
#df <- read.csv("./foods_small.csv") %>% select(-c(X)) #optional: read in saved data to save future pre-processing time
#examine text data
knitr::kable(skim(df)) #overview of data
| skim_type | skim_variable | n_missing | complete_rate | character.min | character.max | character.empty | character.n_unique | character.whitespace | factor.ordered | factor.n_unique | factor.top_counts |
|---|---|---|---|---|---|---|---|---|---|---|---|
| character | score | 3 | 0.9997000 | 3 | 3 | 0 | 5 | 0 | NA | NA | NA |
| character | text | 19 | 0.9981002 | 33 | 6839 | 0 | 9814 | 0 | NA | NA | NA |
| factor | review_id | 0 | 1.0000000 | NA | NA | NA | NA | NA | FALSE | 10000 | 311: 2, 4: 1, 16: 1, 45: 1 |
| factor | productId | 0 | 1.0000000 | NA | NA | NA | NA | NA | FALSE | 3525 | 358: 82, 949: 80, 756: 76, 682: 72 |
| factor | userId | 0 | 1.0000000 | NA | NA | NA | NA | NA | FALSE | 9280 | 484: 7, 495: 7, 527: 7, 165: 6 |
Takeaway: We should have 5 columns in the “df” dataframe that represent a random subset of 10,000 Amazon Fine Foods review data that we will be using for this tutorial:
review_id (factor): review identifier (created during data cleaning based on row number)
productId (factor): product identifier for product being reviewed (replaced with sequential number ID during data cleaning)
userId (factor): reviewer identifier (replaced with sequential number ID during data cleaning)
score (character): number of stars given by the reviewer
text (character): complete review text
Question to be addressed: What are some common customer experiences and painpoints with Amazon Fine Foods products?
#descriptive stats
psych::describe(df$score) #describe scores
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1* 1 9998 4.14 1.32 5 4.41 0 1 5 4 -1.36 0.46 0.01
#plot scores
df %>%
drop_na(score) %>%
ggplot(aes(x = as.factor(score))) +
geom_bar(color = "darkgray", fill = "gray") +
labs(title = "Frequency of review scores",
x = "Score", y = "Frequency") +
theme_classic()
Takeaway: Most reviews have 5 stars.
df %>%
count(productId) %>%
psych::describe() #describe number of reviews per product
## vars n mean sd median trimmed mad min max range skew
## productId* 1 3525 4867.68 2815.71 4853 4861.48 3660.54 1 9768 9767 0.01
## n 2 3525 2.84 5.76 1 1.62 0.00 1 82 81 6.93
## kurtosis se
## productId* -1.20 47.43
## n 62.69 0.10
#plot number of reviews per product
df %>%
count(productId) %>%
ggplot(aes(x = n)) +
geom_histogram(color = "darkgray", fill = "gray", binwidth = 5) +
labs(title = "Frequency of reviews per product",
x = "Review count per product", y = "Frequency") +
theme_classic()
Takeaway: Most products have several reviews.
#examine whether review scores vary among highly reviewed and non-highly-reviewed products
df <- df %>%
add_count(productId, name = "n_reviews") %>% #add column with count for number of reviews
dplyr::group_by(productId) %>%
dplyr::mutate(highly_reviewed = ifelse(n_reviews > 2.84, "highly-reviewed", "not highly-reviewed")) %>% #create new column denoting if product is above/below mean of n reviews
dplyr::ungroup()
psych::describeBy(df$score, df$highly_reviewed) #examine whether scores differ on number of reviews
##
## Descriptive statistics by group
## group: highly-reviewed
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1* 1 6684 5.13 1.29 6 5.39 0 1 6 5 -1.33 0.47 0.02
## ------------------------------------------------------------
## group: not highly-reviewed
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1* 1 3314 4.15 1.38 5 4.44 0 1 5 4 -1.4 0.44 0.02
#plot scores by high/low review
df %>%
drop_na(score) %>%
ggplot(aes(x = as.factor(score))) +
geom_bar(color = "darkgray", fill = "gray") +
labs(title = "Frequency of review scores for highly-reviewed and not-highly-reviewed products",
x = "Score", y = "Frequency") +
theme_classic() +
facet_wrap(~highly_reviewed)
Takeaway: Products that received more reviews than the mean and products that received fewer don’t differ substantially in terms of the distributions of stars given.
df %>%
count(userId) %>%
psych::describe() #describe number of reviews per reviewer
## vars n mean sd median trimmed mad min max range
## userId* 1 9280 28765.37 16478.39 28659 28773.47 21057.37 9 57220 57211
## n 2 9280 1.08 0.35 1 1.00 0.00 1 7 6
## skew kurtosis se
## userId* 0.01 -1.20 171.06
## n 6.50 60.94 0.00
#plot number of reviews per reviewer
df %>%
count(userId) %>%
ggplot(aes(x = n)) +
geom_histogram(color = "darkgray", fill = "gray", binwidth = 1) +
labs(title = "Frequency of reviews per reviewer",
x = "Review count per reviewer", y = "Frequency") +
theme_classic()
Takeaway: Most reviewers reviewed only once in this subset.
#add wordcount column
df <- df %>%
as.data.frame() %>%
dplyr::group_by(review_id) %>%
dplyr::mutate(review_wordcount = str_count(text, pattern = "\\w+")) %>% #add wordcount column using regex
dplyr::ungroup()
df$review_wordcount <- as.numeric(df$review_wordcount)
#examine results
psych::describe(df$review_wordcount) #describe wordcount
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 9982 84.64 82.7 60 68.95 43 6 1318 1312 3.87 27.9 0.83
hist(df$review_wordcount, breaks = 100) #wordcount histogram
Takeaway: Review length is relatively short.
#plot review score and review length
df %>%
subset(review_wordcount <= (84.64+(3*82.70))) %>% #subset to remove review length outliers more than 3 SD from average review length
ggplot(aes(y = as.numeric(review_wordcount), x = as.factor(score), color = as.factor(score), fill = as.factor(score))) +
geom_jitter(size = 1, color = "gray") +
geom_violin(alpha = 0.7) +
scale_fill_manual(values = MetBrew_Egypt) +
scale_color_manual(values = MetBrew_Egypt) +
geom_boxplot(width = 0.1, color = "white") +
labs(x = "Review Score", y = "Review wordcount") +
theme_classic() +
theme(legend.position = "none")
Takeaway: Review scores do not appear to vary much with length.
Some analyses suggest text cleaning to reduce noise and improve analysis accuracy and efficiency, and should be used only when recommended.
Here, we will clean text by placing it in lowercase, removing extra whitespace, and replacing numeric characters with words (e.g., “1” -> “one”).
We will also create a version of the text data with further cleaning that is needed for some analyses, with:
Refer to best practices and documentation for specific analyses on which text cleaning steps are recommended (if any).
Consider whether an analysis is compositional (i.e., takes context into account vs. considering each word individually) before implementing text cleaning steps.
Sensitivity analyses can be a great tool to examine whether results change with and without text cleaning.
Be transparent about text cleaning steps taken when sharing results.
#placing all text in lowercase, replacing numbers with numeric in text, and removing extra white space to clean text up
df$text <- tolower(df$text) #place all text in lowercase
df$text <- trimws(df$text) #remove extra whitespace
df$text <- textclean::replace_number(df$text, remove = T) #replace numeric characters
df$text <- gsub("[0-9]+", "", df$text) #remove any remaining numbers altogether
df$text <- gsub("<[^>]*>", "", df$text) #remove css tags contained in < >
#see results
knitr::kable(head(df[, c("review_id", "text", "review_wordcount")], n = 5))
| review_id | text | review_wordcount |
|---|---|---|
| 24421 | i have never been much of a soup person, unless down with a cold or too tired to cook up something fancy. once heated, i crumbled tortilla chips on top (or even some shredded cheese). since i love mexican food, this satisfied me for the evening. i was advised by progresso that they have discontinued this soup. thank goodness amazon had some for purchasing. | 64 |
| 39808 | the bars make for a nice snack and they taste natural but are not something that i would eat for pleasure. | 21 |
| 56240 | we really enjoyed this product – containes no sugar (contains dates and such) and it’s healthy while giving you the chocolate fix you need! | 24 |
| 70928 | the toy is excellent. my golden retriever is teething and this has been perfect for her.-stars because product description is not accurate. picture showed t-rex, but i received a brontosaurus. no big deal but if there are multiple models the description should indicate which one you’re getting | 53 |
| 77000 | type of pasta: orzo, it resembles more like rice than pasta, so this dish felt like a chicken,cheesy rice,broccoli type of dish as opposed to straight pasta.resembles: hamburger helper’s but much cheesierdifficulty: very easytime: minutesa box fills up: about people (maybe more if you eat less)amount of broccoli: very little, wish they had more! but yes, very little, though you can taste it. basically very small little flecks of freeze dried broccoli.amount of cheese: good amount of cheese for people!taste: b+ease to make: anutrition: c+/b-overall: b | 120 |
df_words <- df #save separate lemmatized df
df_words$text <- textclean::replace_contraction(df_words$text) #replace contractions
df_words$text <- gsub("[[:punct:]]", "", df_words$text) #remove remaining punctuation
df_words$text <- textstem::lemmatize_strings(df_words$text) #lemmatize words
#tokenize
df_words <- df_words %>%
tidytext::unnest_tokens(word, text, token = "words") %>%
dplyr::filter(!is.na(word))
stopwords <- subset(stop_words, lexicon == "snowball") #select stopword lexicon
#remove stopwords from data
df_words <- df_words %>%
dplyr::filter(!word %in% stopwords$word) %>%
dplyr::filter(!is.na(word))
#see results
knitr::kable(head(df_words[, c("review_id", "word")], n = 5)) #top 5 rows
| review_id | word |
|---|---|
| 24421 | never |
| 24421 | much |
| 24421 | soup |
| 24421 | person |
| 24421 | unless |
#frequent word count dataframe
df_count <- df_words %>%
count(word, sort = TRUE)
#show top frequent words table
knitr::kable(head(df_count, n = 25))
| word | n |
|---|---|
| good | 7249 |
| like | 5261 |
| much | 4943 |
| can | 4746 |
| taste | 4636 |
| flavor | 3735 |
| get | 3336 |
| one | 3335 |
| will | 3214 |
| love | 3181 |
| product | 3121 |
| make | 3113 |
| just | 3086 |
| try | 2999 |
| use | 2901 |
| great | 2838 |
| buy | 2754 |
| coffee | 2742 |
| food | 2660 |
| tea | 2454 |
| find | 2339 |
| dog | 2317 |
| eat | 2297 |
| little | 2007 |
| go | 1883 |
## frequent words wordcloud
wordcloud(df$text, min.freq = 500, colors = brewer.pal(12, "Dark2"))
#compare most frequent words among high and low scored reviews
df_stars <- df %>%
dplyr::select(c(review_id, score)) %>%
dplyr::mutate(high_low_score = ifelse(score == "1.0" | score == "2.0", "low",
ifelse(score == "5.0", "high", NA))) #create high/low review score column
#plot top 25 words in high and low score reviews
df_words %>%
left_join(., df_stars, by = c("review_id")) %>%
subset(!is.na(high_low_score)) %>%
dplyr::group_by(high_low_score) %>%
count(word, sort = TRUE) %>%
top_n(n = 25) %>%
dplyr::mutate(word = reorder(word, n)) %>%
dplyr::ungroup() %>%
ggplot(aes(n, reorder_within(word, n, high_low_score))) +
geom_col(color = "gray", fill = "darkgray") +
labs(y = "Word", x = "Frequency ", title = "Amazon fine foods reviews subset top 25 most frequent words",
subtitle = "Among high-scoring (5 star) and low-scoring (1 or 2 star) reviews") +
geom_text(aes(label = n), hjust = 1, colour = "white") +
theme_classic() +
facet_wrap(~high_low_score, scales = "free")
## Selecting by n
Takeaway: The top few words across all reviews are “good,” “like,” “much,” “can,” and “taste”. 5-star reviews and 1- and 2-star reviews feature fairly similar words, although each word is taken out of context here.
#frequent word count with raw text
df_count2 <- df %>%
tidytext::unnest_tokens(word, text, token = "words") %>% #redo tokenization on raw text without cleaning
dplyr::filter(!is.na(word)) %>%
count(word, sort = TRUE) #count occurrences of raw words
#top frequent words table
knitr::kable(head(df_count2, n = 25)) #show top 25 words
| word | n |
|---|---|
| the | 33258 |
| i | 25459 |
| and | 22868 |
| a | 21338 |
| to | 18097 |
| it | 15898 |
| of | 14419 |
| is | 13309 |
| this | 11731 |
| for | 9676 |
| in | 9596 |
| my | 7815 |
| that | 7647 |
| but | 6761 |
| with | 6308 |
| not | 6094 |
| have | 6089 |
| you | 5882 |
| are | 5803 |
| was | 5677 |
| they | 5309 |
| as | 4996 |
| like | 4757 |
| on | 4725 |
| so | 4536 |
#frequent bigram count
df_bigrams <- df %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% #repeat tokenization where token = bigram
dplyr::filter(!is.na(bigram)) %>%
tidyr::separate(bigram, c("word1", "word2"), sep = " ") %>%
dplyr::filter(!word1 %in% stopwords$word,
!word2 %in% stopwords$word) #remove stopwords
df_bigrams$word1 <- textstem::lemmatize_strings(df_bigrams$word1) #lemmatize words
df_bigrams$word2 <- textstem::lemmatize_strings(df_bigrams$word2) #lemmatize words
#paste both words in bigram together in new column
df_bigrams <- df_bigrams %>%
dplyr::mutate(bigram = paste(word1, word2, sep = " ")) %>%
select(-c(word1, word2)) %>%
count(bigram, sort = TRUE)
#top frequent bigrams table
knitr::kable(head(df_bigrams, n = 25))
| bigram | n |
|---|---|
| taste like | 459 |
| k cup | 425 |
| highly recommend | 322 |
| dog food | 321 |
| green tea | 319 |
| gluten free | 317 |
| grocery store | 294 |
| peanut butter | 238 |
| much good | 233 |
| dog love | 232 |
| year old | 230 |
| taste good | 217 |
| taste great | 211 |
| really like | 203 |
| dark chocolate | 202 |
| cat food | 176 |
| really good | 175 |
| can get | 171 |
| great taste | 171 |
| great product | 158 |
| just like | 154 |
| potato chip | 154 |
| good price | 152 |
| look like | 141 |
| good taste | 139 |
#compare high and low scored reviews
bigram_df_plot <- df %>%
left_join(., df_stars, by = c("review_id")) %>%
subset(!is.na(high_low_score)) %>%
dplyr::group_by(high_low_score) %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
dplyr::filter(!is.na(bigram)) %>%
tidyr::separate(bigram, c("word1", "word2"), sep = " ") %>%
dplyr::filter(!word1 %in% stopwords$word,
!word2 %in% stopwords$word) #remove stopwords
bigram_df_plot$word1 <- textstem::lemmatize_strings(bigram_df_plot$word1) #lemmatize words
bigram_df_plot$word2 <- textstem::lemmatize_strings(bigram_df_plot$word2) #lemmatize words
#plot top bigrams in high and low scoring reviews
bigram_df_plot %>%
dplyr::mutate(bigram = paste(word1, word2, sep = " ")) %>%
select(-c(word1, word2)) %>%
count(bigram, sort = TRUE) %>%
top_n(n = 25) %>%
dplyr::mutate(bigram = reorder(bigram, n)) %>%
dplyr::ungroup() %>%
ggplot(aes(n, reorder_within(bigram, n, high_low_score))) +
geom_col(color = "gray", fill = "darkgray") +
labs(y = "Bigram", x = "Frequency ", title = "Amazon fine foods reviews top 25 most frequent bigrams",
subtitle = "Among high-scoring (5 star) and low-scoring (1 or 2 star) reviews") +
geom_text(aes(label = n), hjust = 1, colour = "white") +
theme_classic() +
facet_wrap(~high_low_score, scales = "free")
## Selecting by n
Takeaway: The top bigrams words across all reviews are “gluten free,” “green tea,” “dog food,” “highly recommend,” and “k cup”. 5-star and 1-and 2-star reviews feature some of these (e.g., “k cup,” “gluten free,” “dog food,” “green tea”), while 5-star reviews feature some more positive bigrams (“highly recommend,” “great product,” “really good,” “taste great”) 1- and 2-star reviews feature some more negative bigrams (“taste like,” “look like,” “never buy”).
#frequent bigrams with raw text
df_bigrams2 <- df %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% #tokenize bigrams without further cleaning
dplyr::filter(!is.na(bigram)) %>%
tidyr::separate(bigram, c("word1", "word2"), sep = " ") %>%
dplyr::mutate(bigram = paste(word1, word2, sep = " ")) %>%
select(-c(word1, word2)) %>%
count(bigram, sort = TRUE)
#top frequent bigrams table
knitr::kable(head(df_bigrams2, n = 25))
| bigram | n |
|---|---|
| of the | 2788 |
| in the | 2330 |
| i have | 2143 |
| it is | 2055 |
| this is | 2029 |
| is a | 1662 |
| i was | 1482 |
| they are | 1459 |
| and i | 1370 |
| if you | 1363 |
| on the | 1329 |
| and the | 1174 |
| this product | 1170 |
| for a | 1140 |
| to be | 1140 |
| it was | 1130 |
| i am | 1090 |
| to the | 1068 |
| for the | 1048 |
| in a | 1033 |
| is the | 974 |
| the best | 970 |
| but i | 926 |
| a little | 908 |
| i would | 895 |
We can summarize the topics discussed across all reviews using Latent Dirichlet Allocation (LDA) topic modeling (Blei, Ng, & Jordan, 2003). There are many approaches for topic modeling but LDA is a common one and effective to provide an overview of review content here.
#estimate an LDA topic model with 15 topics
set.seed(2025)
reviews_lda <- topicmodels::LDA(data_dtm, k = 15, list = control(seed = 2025)) #run topic model with 15 topics
#get top 15 terms per topic
knitr::kable(topicmodels::get_terms(reviews_lda, 15))
| Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 | Topic 6 | Topic 7 | Topic 8 | Topic 9 | Topic 10 | Topic 11 | Topic 12 | Topic 13 | Topic 14 | Topic 15 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| product | like | good | flavor | food | organic | tea | dog | order | sugar | chocolate | price | almond | coffee | use |
| use | taste | make | chip | cat | ingredient | green | treat | box | drink | bar | amazon | salt | cup | hot |
| cereal | can | use | good | eat | product | good | love | product | taste | good | buy | nut | good | sauce |
| good | good | cook | like | dog | fat | flavor | get | package | flavor | taste | store | seed | flavor | popcorn |
| hair | much | mix | taste | can | list | taste | chew | arrive | juice | like | good | blue | like | good |
| like | try | like | eat | much | food | like | one | good | fruit | flavor | find | good | taste | make |
| bottle | just | noodle | snack | good | much | drink | will | receive | like | make | can | snack | much | oz |
| shampoo | get | just | love | like | contain | much | give | cookie | water | much | product | healthy | one | just |
| much | think | taste | great | love | good | bag | good | one | sweet | will | order | much | strong | taste |
| dry | really | can | much | one | make | ginger | tooth | will | can | dark | much | taste | try | will |
| oatmeal | one | much | try | year | fiber | try | like | make | much | just | great | diamond | roast | much |
| will | bad | pasta | bag | old | protein | make | much | time | good | eat | get | blood | drink | add |
| oil | say | easy | peanut | get | rice | love | small | can | add | mix | love | keep | blend | spice |
| work | will | free | butter | baby | g | cup | can | gift | soda | love | ship | great | make | get |
| try | give | gluten | potato | will | gram | one | great | love | use | great | purchase | eat | bean | pop |
#visualize top 5 words per topic
tidytext::tidy(reviews_lda, matrix = "beta") %>%
dplyr::group_by(topic) %>%
top_n(5, beta) %>% #count words with highest beta per topic
dplyr::ungroup() %>%
ggplot(aes(x = beta, y = reorder_within(term, beta, topic), fill = as.factor(topic))) +
geom_col() +
scale_fill_manual(values = MetBrew_Tam) +
scale_color_manual(values = MetBrew_Tam) +
facet_wrap(~ topic, scales = "free") +
labs(title = "Top 5 words per LDA topic across Amazon fine food reviews subset",
x = "Probability of word belonging to topic (beta)", y = "Word") +
theme_classic() +
theme(legend.position = "none")
Takeaway: This subset of reviews is primarily about specific products or broader consumer experiences such as finding items, pricing, and shipping.
#redoing DTM as shown above without text cleaning prior to tokenization
data_dtm2 <- df %>%
tidytext::unnest_tokens(word, text, token = "words") %>%
dplyr::filter(!is.na(word)) %>%
dplyr::count(review_id, word) %>%
dplyr::mutate(review_id = as.numeric(review_id)) %>%
tidytext::cast_dtm(document = review_id, term = word, value = n)
#estimate an LDA topic model with 15 topics
reviews_lda2 <- topicmodels::LDA(data_dtm2, k = 15, list = control(seed = 2025)) #run topic model
#get top 15 terms per topic
knitr::kable(topicmodels::get_terms(reviews_lda2, 15))
| Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 | Topic 6 | Topic 7 | Topic 8 | Topic 9 | Topic 10 | Topic 11 | Topic 12 | Topic 13 | Topic 14 | Topic 15 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| a | to | i | the | the | i | a | the | the | the | it | the | and | to | a |
| of | a | a | to | a | and | i | and | i | i | to | and | i | i | i |
| i | and | it | a | it | the | of | of | and | to | and | this | is | it | this |
| it | of | in | i | and | a | is | to | is | it | the | in | my | is | the |
| to | i | and | and | my | is | them | a | to | is | of | it | not | this | in |
| is | it | this | of | for | but | the | for | for | and | this | of | to | for | but |
| the | this | the | in | to | it | for | are | this | a | i | you | of | of | of |
| and | for | my | are | in | so | are | is | was | my | my | with | that | the | it |
| was | not | but | for | this | have | these | have | a | you | was | not | it | with | with |
| for | that | is | that | on | they | my | in | of | but | that | a | in | in | as |
| you | with | of | with | but | these | was | i | my | in | for | i | they | that | is |
| with | you | to | as | that | are | they | on | you | so | them | flavor | like | and | have |
| tea | food | that | so | with | that | and | with | that | that | is | is | you | these | so |
| but | in | like | have | like | to | so | you | like | at | but | as | this | as | for |
| in | dog | for | not | they | this | have | good | one | like | in | was | these | they | and |
#get VADER compound sentiment scores for each review using minimally-cleaned text data since it is compositional
df_vader <- df %>%
dplyr::group_by(review_id) %>%
dplyr::mutate(vader_output = vader::vader_df(text)) %>% #use vader package to assign sentiment values per review
dplyr::mutate(sentiment_vader = vader_output$compound) %>% #create new column for compound vader sentiment score
select(-c(vader_output)) %>% #remove extraneous vader output
dplyr::ungroup() %>%
as.data.frame()
#examine sentiment
knitr::kable(skim(df_vader))
| skim_type | skim_variable | n_missing | complete_rate | character.min | character.max | character.empty | character.n_unique | character.whitespace | factor.ordered | factor.n_unique | factor.top_counts | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| character | score | 3 | 0.9997000 | 3 | 3 | 0 | 5 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | text | 19 | 0.9981002 | 33 | 6608 | 0 | 9814 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | highly_reviewed | 0 | 1.0000000 | 15 | 19 | 0 | 2 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | review_id | 0 | 1.0000000 | NA | NA | NA | NA | NA | FALSE | 10000 | 311: 2, 4: 1, 16: 1, 45: 1 | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | productId | 0 | 1.0000000 | NA | NA | NA | NA | NA | FALSE | 3525 | 358: 82, 949: 80, 756: 76, 682: 72 | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | userId | 0 | 1.0000000 | NA | NA | NA | NA | NA | FALSE | 9280 | 484: 7, 495: 7, 527: 7, 165: 6 | NA | NA | NA | NA | NA | NA | NA | NA |
| numeric | n_reviews | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | 14.5380462 | 19.0767863 | 1.000 | 2.000 | 5.000 | 22.000 | 82 | ▇▂▁▁▁ |
| numeric | review_wordcount | 19 | 0.9981002 | NA | NA | NA | NA | NA | NA | NA | NA | 84.6415548 | 82.7039978 | 6.000 | 35.000 | 60.000 | 103.000 | 1318 | ▇▁▁▁▁ |
| numeric | sentiment_vader | 21 | 0.9979002 | NA | NA | NA | NA | NA | NA | NA | NA | 0.6600209 | 0.4565488 | -0.982 | 0.601 | 0.859 | 0.944 | 1 | ▁▁▁▁▇ |
#histogram of sentiment
hist(df_vader$sentiment_vader)
#see what goes into compound sentiment breakdown
df_vader_breakdown <- df_vader %>%
dplyr::group_by(review_id) %>%
dplyr::mutate(vader_output = vader_df(text)) %>% #generate vader output
dplyr::mutate(prop_positive = vader_output$pos, #create columns showing proportion of each review that are positive, negative, and neutral
prop_neutral = vader_output$neu,
prop_negative = vader_output$neg) %>%
select(review_id, prop_positive, prop_neutral, prop_negative) %>%
dplyr::ungroup() %>%
pivot_longer(cols = c(prop_positive, prop_neutral, prop_negative), names_to = "sentiment_vader", values_to = "proportion") %>%
dplyr::mutate(sentiment_vader = factor(sentiment_vader, levels = c("prop_negative","prop_neutral","prop_positive"), labels = c("negative", "neutral", "positive"))) %>% #relabeling negative, neutral, and positive proportions
dplyr::filter(!(proportion < 0 | proportion > 1)) %>% #filter out miscalculated proportions
drop_na(proportion)
#plot proportion of each review that is negative, positive, and neutral
df_vader_breakdown %>%
ggplot(aes(x = as.factor(review_id), y = proportion, fill = as.factor(sentiment_vader))) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("red3", "orange2", "gold")) +
labs(title = "Proportions of negative, neutral, and positive VADER sentiment across reviews",
subtitle = "These data underlie the compound VADER sentiment values.",
x = "Review",
y = "Proportion",
fill = "Sentiment (VADER)") +
ylim(0,1) +
coord_flip() +
theme_classic() +
theme(legend.position = "bottom") +
theme(axis.text.x=element_blank())
Takeaways: Reviews are mostly positive or neutral in their language used according to VADER.
#plot association between vader linguistic sentiment and review score
df_vader %>%
drop_na(sentiment_vader, score) %>%
ggplot(aes(y = as.numeric(sentiment_vader), x = as.factor(score), color = as.factor(score), fill = as.factor(score))) +
geom_jitter(size = 1, color = "gray") +
geom_violin(alpha = 0.7) +
scale_fill_manual(values = MetBrew_Egypt) +
scale_color_manual(values = MetBrew_Egypt) +
geom_boxplot(width = 0.1, color = "white") +
labs(x = "Review Score", y = "Review linguistic sentiment (VADER)") +
theme_minimal() +
theme(legend.position = "none")
Takeaways: More positive linguistic sentiment via VADER is directionally associated with much better review scores.
#plot association between vader linguistic sentiment and review length
df_vader %>%
subset(review_wordcount <= (84.64+(3*82.70))) %>% #subset to remove review length outliers more than 3 SD from average review length
drop_na(sentiment_vader, review_wordcount) %>%
ggplot(aes(y = as.numeric(sentiment_vader), x = as.numeric(review_wordcount))) +
geom_jitter(size = 1, color = "gray") +
geom_smooth(method = "loess") +
labs(x = "Review Wordcount", y = "Review linguistic sentiment (VADER)") +
theme_minimal() +
theme(legend.position = "none")
## `geom_smooth()` using formula = 'y ~ x'
Takeaways: Reviews have slightly more positive linguistic sentiment via VADER as they get longer, but are fairly positive in linguistic sentiment to begin with.
#select sentiment database - afinn
AFINN <- tidytext:: get_sentiments("afinn")
#join sentiment values to review data by word matches
df_AFINN <- df_words %>%
inner_join(AFINN, by = c("word")) %>%
rename("sentiment_AFINN" = "value")
#examine sentiment
knitr::kable(skim(df_AFINN))
| skim_type | skim_variable | n_missing | complete_rate | character.min | character.max | character.empty | character.n_unique | character.whitespace | factor.ordered | factor.n_unique | factor.top_counts | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| character | score | 0 | 1 | 3 | 3 | 0 | 5 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | highly_reviewed | 0 | 1 | 15 | 19 | 0 | 2 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | word | 0 | 1 | 2 | 17 | 0 | 779 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | review_id | 0 | 1 | NA | NA | NA | NA | NA | FALSE | 9777 | 363: 65, 539: 65, 484: 57, 360: 55 | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | productId | 0 | 1 | NA | NA | NA | NA | NA | FALSE | 3465 | 853: 502, 756: 458, 915: 449, 602: 424 | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | userId | 0 | 1 | NA | NA | NA | NA | NA | FALSE | 9070 | 590: 215, 412: 130, 135: 120, 527: 84 | NA | NA | NA | NA | NA | NA | NA | NA |
| numeric | n_reviews | 0 | 1 | NA | NA | NA | NA | NA | NA | NA | NA | 15.666661 | 19.184105 | 1 | 2 | 6 | 23 | 82 | ▇▂▁▁▁ |
| numeric | review_wordcount | 0 | 1 | NA | NA | NA | NA | NA | NA | NA | NA | 146.064745 | 148.515108 | 10 | 56 | 100 | 181 | 1318 | ▇▁▁▁▁ |
| numeric | sentiment_AFINN | 0 | 1 | NA | NA | NA | NA | NA | NA | NA | NA | 1.481144 | 1.792796 | -4 | 1 | 2 | 3 | 5 | ▁▂▂▇▁ |
#histogram of sentiment
hist(df_AFINN$sentiment_AFINN)
#see distribution of average AFINN sentiment across reviews
df_AFINN %>%
dplyr::group_by(review_id) %>%
dplyr::mutate(average_sentiment_AFINN = mean(sentiment_AFINN),
primary_sent_direction = ifelse(average_sentiment_AFINN > 0, "positive", "negative")) %>% #create columns for average sentiment per review and average direction positive or negative
dplyr::ungroup() %>%
select(c(review_id, average_sentiment_AFINN, primary_sent_direction)) %>%
distinct() %>%
ggplot(aes(x = average_sentiment_AFINN, y = reorder(as.factor(review_id), average_sentiment_AFINN), color = as.factor(primary_sent_direction))) + #plot average sentiment scores with color by average direction
geom_col() +
scale_color_manual(values = c("red3", "green3")) +
labs(title = "Average AFINN sentiment distribution across reviews",
y = "Review",
x = "Average AFINN sentiment",
fill = "Sentiment (VADER)") +
theme_classic() +
theme(axis.text.y=element_blank()) +
theme(legend.position = "none")
## Ignoring unknown labels:
## • fill : "Sentiment (VADER)"
Takeaways: Reviews are much more positive than negative in their linguistic sentiment according to AFINN.
#plot association between AFINN sentiment and review score
df_AFINN %>%
dplyr::group_by(review_id) %>%
dplyr::mutate(average_sentiment_AFINN = mean(sentiment_AFINN)) %>%
dplyr::ungroup() %>%
select(c(review_id, average_sentiment_AFINN, score)) %>%
distinct() %>%
ggplot(aes(y = as.numeric(average_sentiment_AFINN), x = as.factor(score), color = as.factor(score), fill = as.factor(score))) +
geom_jitter(size = 1, color = "gray") +
geom_violin(alpha = 0.7) +
scale_fill_manual(values = MetBrew_Egypt) +
scale_color_manual(values = MetBrew_Egypt) +
geom_boxplot(width = 0.1, color = "white") +
labs(x = "Review Score", y = "Average linguistic sentiment (AFINN)") +
theme_minimal() +
theme(legend.position = "none")
Takeaways: More positive linguistic sentiment via AFINN is directionally associated with slightly better review scores.
#plot association between AFINN sentiment and review length
df_AFINN %>%
subset(review_wordcount <= (84.64+(3*82.70))) %>% #subset to remove review length outliers more than 3 SD from average review length
dplyr::group_by(review_id) %>%
dplyr::mutate(average_sentiment_AFINN = mean(sentiment_AFINN)) %>%
dplyr::ungroup() %>%
drop_na(average_sentiment_AFINN, review_wordcount) %>%
ggplot(aes(y = as.numeric(average_sentiment_AFINN), x = as.numeric(review_wordcount))) +
geom_jitter(size = 1, color = "gray") +
geom_smooth(method = "loess") +
labs(x = "Review Wordcount", y = "Review linguistic sentiment (AFINN)") +
theme_minimal() +
theme(legend.position = "none")
## `geom_smooth()` using formula = 'y ~ x'
Takeaways: Reviews have similar linguistic sentiment via AFINN regardless of length.
In this workshop, we:
Set up a text analysis plan in R
Learned about how and why to use different text cleaning steps
Examined common experiences and painpoints discussed in a random subset of Amazon fine foods reviews (McAuley & Leskovec, 2013) using basic text analyses and visualizations, including:
Evaluation of most frequent words, bigrams, and associations between words in all reviews and those with high vs. low stars
Summarization of themes using topic modeling (LDA)
Examination of linguistic sentiment via sentiment analysis
Developed meaningful business insights from a subset of unstructured review text data
Remember to carefully evaluate data sources and quality, clean text data in line with best practices, choose analyses that are best suited to your use case, and interpret results in context and keeping in mind that they are often observational.
Happy text analyzing!
McAuley, J. & Leskovec, J. (2013). From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. WWW, 2013.
Porter, M. F. 2001. “Snowball: A Language for Stemming Algorithms.” https://snowballstem.org.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal Of Machine Learning Research, 3(4/5), 993-1022. https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
Hutto, C., & Gilbert, E. (2014). VADER: A Parsimonious Rule-Based
Model for Sentiment Analysis of Social Media Text. Proceedings of
the
International AAAI Conference on Web and Social Media, 8(1), 216-225. https://doi.org/10.1609/icwsm.v8i1.14550
Nielsen, F. Å. (2011) A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. Proceedings of the ESWC2011 Workshop on ‘Making Sense of Microposts’: Big things come in small packages 718 in CEUR Workshop Proceedings 93-98. 2011 May. http://arxiv.org/abs/1103.2903
R Core Team (2025). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.
Wickham H (2023). stringr: Simple, Consistent Wrappers for Common String Operations. doi:10.32614/CRAN.package.stringr https://doi.org/10.32614/CRAN.package.stringr, R package version 1.5.1, https://CRAN.R-project.org/package=stringr.
Waring E, Quinn M, McNamara A, Arino de la Rubia E, Zhu H, Ellis S (2022). skimr: Compact and Flexible Summaries of Data. doi:10.32614/CRAN.package.skimr https://doi.org/10.32614/CRAN.package.skimr, R package version 2.1.5, https://CRAN.R-project.org/package=skimr.
Silge J, Robinson D (2016). “tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS, 1(3). doi:10.21105/joss.00037 https://doi.org/10.21105/joss.00037, http://dx.doi.org/10.21105/joss.00037.
Rinker, T. W. (2018). textclean: Text Cleaning Tools version 0.9.3. Buffalo, New York. https://github.com/trinker/textclean
Fellows I (2018). wordcloud: Word Clouds. doi:10.32614/CRAN.package.wordcloud https://doi.org/10.32614/CRAN.package.wordcloud, R package version 2.6, https://CRAN.R-project.org/package=wordcloud
Feinerer I, Hornik K (2025). tm: Text Mining Package. doi:10.32614/CRAN.package.tm https://doi.org/10.32614/CRAN.package.tm, R package version 0.7-16, https://CRAN.R-project.org/package=tm.
Roehrick K (2020). vader: Valence Aware Dictionary and sEntiment Reasoner (VADER). doi:10.32614/CRAN.package.vader https://doi.org/10.32614/CRAN.package.vader, R package version 0.2.1, https://CRAN.R-project.org/package=vader.
Grün B, Hornik K (2011). “topicmodels: An R Package for Fitting Topic Models.” Journal of Statistical Software, 40(13), 1–30. doi:10.18637/jss.v040.i13.
Mills BR (2022). MetBrewer: Color Palettes Inspired by Works at the Metropolitan Museum of Art. doi:10.32614/CRAN.package.MetBrewer https://doi.org/10.32614/CRAN.package.MetBrewer, R package version 0.2.0, https://CRAN.R-project.org/package=MetBrewer.
Posit team (2025). RStudio: Integrated Development Environment for R. Posit Software, PBC, Boston, MA. URL http://www.posit.co/.
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.5.0 (2025-04-11)
## os macOS Sequoia 15.6
## system aarch64, darwin20
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz America/New_York
## date 2025-12-12
## pandoc 3.4 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)
## quarto 1.6.42 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/quarto
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.5.0)
## bslib 0.9.0 2025-01-30 [1] CRAN (R 4.5.0)
## cachem 1.1.0 2024-05-16 [1] CRAN (R 4.5.0)
## cli 3.6.5 2025-04-23 [1] CRAN (R 4.5.0)
## data.table 1.17.2 2025-05-12 [1] CRAN (R 4.5.0)
## dichromat 2.0-0.1 2022-05-02 [1] CRAN (R 4.5.0)
## digest 0.6.37 2024-08-19 [1] CRAN (R 4.5.0)
## dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.5.0)
## evaluate 1.0.3 2025-01-10 [1] CRAN (R 4.5.0)
## farver 2.1.2 2024-05-13 [1] CRAN (R 4.5.0)
## fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.5.0)
## forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.5.0)
## fs 1.6.6 2025-04-12 [1] CRAN (R 4.5.0)
## generics 0.1.4 2025-05-09 [1] CRAN (R 4.5.0)
## ggplot2 * 4.0.0 2025-09-11 [1] CRAN (R 4.5.0)
## glue 1.8.0 2024-09-30 [1] CRAN (R 4.5.0)
## gtable 0.3.6 2024-10-25 [1] CRAN (R 4.5.0)
## hms 1.1.3 2023-03-21 [1] CRAN (R 4.5.0)
## htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.5.0)
## janeaustenr 1.0.0 2022-08-26 [1] CRAN (R 4.5.0)
## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.5.0)
## jsonlite 2.0.0 2025-03-27 [1] CRAN (R 4.5.0)
## knitr 1.50 2025-03-16 [1] CRAN (R 4.5.0)
## koRpus 0.13-8 2021-05-17 [1] CRAN (R 4.5.0)
## koRpus.lang.en 0.1-4 2020-10-24 [1] CRAN (R 4.5.0)
## labeling 0.4.3 2023-08-29 [1] CRAN (R 4.5.0)
## lattice 0.22-6 2024-03-20 [1] CRAN (R 4.5.0)
## lexicon 1.2.1 2019-03-21 [1] CRAN (R 4.5.0)
## lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.5.0)
## lubridate * 1.9.4 2024-12-08 [1] CRAN (R 4.5.0)
## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.5.0)
## Matrix 1.7-3 2025-03-11 [1] CRAN (R 4.5.0)
## MetBrewer * 0.2.0 2022-03-21 [1] CRAN (R 4.5.0)
## mgcv 1.9-1 2023-12-21 [1] CRAN (R 4.5.0)
## mnormt 2.1.1 2022-09-26 [1] CRAN (R 4.5.0)
## modeltools 0.2-24 2025-05-02 [1] CRAN (R 4.5.0)
## nlme 3.1-168 2025-03-31 [1] CRAN (R 4.5.0)
## NLP * 0.3-2 2024-11-20 [1] CRAN (R 4.5.0)
## pillar 1.11.0 2025-07-04 [1] CRAN (R 4.5.0)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.5.0)
## plyr 1.8.9 2023-10-02 [1] CRAN (R 4.5.0)
## psych * 2.5.3 2025-03-21 [1] CRAN (R 4.5.0)
## purrr * 1.0.4 2025-02-05 [1] CRAN (R 4.5.0)
## qdapRegex 0.7.10 2025-03-24 [1] CRAN (R 4.5.0)
## R6 2.6.1 2025-02-15 [1] CRAN (R 4.5.0)
## rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.5.0)
## RColorBrewer * 1.1-3 2022-04-03 [1] CRAN (R 4.5.0)
## Rcpp 1.0.14 2025-01-12 [1] CRAN (R 4.5.0)
## readr * 2.1.5 2024-01-10 [1] CRAN (R 4.5.0)
## repr 1.1.7 2024-03-22 [1] CRAN (R 4.5.0)
## reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.5.0)
## rlang 1.1.6 2025-04-11 [1] CRAN (R 4.5.0)
## rmarkdown 2.29 2024-11-04 [1] CRAN (R 4.5.0)
## rstudioapi 0.17.1 2024-10-22 [1] CRAN (R 4.5.0)
## S7 0.2.0 2024-11-07 [1] CRAN (R 4.5.0)
## sass 0.4.10 2025-04-11 [1] CRAN (R 4.5.0)
## scales 1.4.0 2025-04-24 [1] CRAN (R 4.5.0)
## sessioninfo 1.2.3 2025-02-05 [1] CRAN (R 4.5.0)
## skimr * 2.1.5 2022-12-23 [1] CRAN (R 4.5.0)
## slam 0.1-55 2024-11-13 [1] CRAN (R 4.5.0)
## SnowballC 0.7.1 2023-04-25 [1] CRAN (R 4.5.0)
## stringi 1.8.7 2025-03-27 [1] CRAN (R 4.5.0)
## stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.5.0)
## sylly 0.1-6 2020-09-20 [1] CRAN (R 4.5.0)
## sylly.en 0.1-3 2018-03-19 [1] CRAN (R 4.5.0)
## syuzhet 1.0.7 2023-08-11 [1] CRAN (R 4.5.0)
## textclean * 0.9.3 2018-07-23 [1] CRAN (R 4.5.0)
## textdata 0.4.5 2024-05-28 [1] CRAN (R 4.5.0)
## textshape 1.7.5 2024-04-01 [1] CRAN (R 4.5.0)
## textstem 0.1.4 2018-04-09 [1] CRAN (R 4.5.0)
## tibble * 3.3.0 2025-06-08 [1] CRAN (R 4.5.0)
## tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.5.0)
## tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.5.0)
## tidytext * 0.4.2 2024-04-10 [1] CRAN (R 4.5.0)
## tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.5.0)
## timechange 0.3.0 2024-01-18 [1] CRAN (R 4.5.0)
## tm * 0.7-16 2025-02-19 [1] CRAN (R 4.5.0)
## tokenizers 0.3.0 2022-12-22 [1] CRAN (R 4.5.0)
## topicmodels * 0.2-17 2024-08-14 [1] CRAN (R 4.5.0)
## tzdb 0.5.0 2025-03-15 [1] CRAN (R 4.5.0)
## vader * 0.2.1 2020-09-07 [1] CRAN (R 4.5.0)
## vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.5.0)
## withr 3.0.2 2024-10-28 [1] CRAN (R 4.5.0)
## wordcloud * 2.6 2018-08-24 [1] CRAN (R 4.5.0)
## xfun 0.52 2025-04-02 [1] CRAN (R 4.5.0)
## xml2 1.3.8 2025-03-14 [1] CRAN (R 4.5.0)
## yaml 2.3.10 2024-07-26 [1] CRAN (R 4.5.0)
##
## [1] /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library
## * ── Packages attached to the search path.
##
## ──────────────────────────────────────────────────────────────────────────────